understand the five most commonly-used data structures in R
be able to create and manipulate these data structures
be familiar with the ‘tibbles’ data structure
6.2 Introduction
In R, “data structures” are the fundamental ways in which data is organised and stored for use in our analysis and modeling.
R has various types of data structures, each optimised for different kinds of tasks. In this tutorial we’ll identify the five most common data structures that you’re likely to use within sport data analytics, and explore how to create and manipulate data within these structures.
6.3 Type One: Matrices
A ‘matrix’ is a two-dimensional data structure.
It’s used to store and organise data in rows and columns, similar to a spreadsheet. It’s important to note that, just like vectors, all elements within a matrix must be of the same data type.
This is a very common data structure in R, so it’s good to be familiar with how to create and manipulate matrices.
Creating matrices
We can use the matrix() function to create a matrix by specifying the dataset, the number of rows, and the number of columns in the matrix.
In this example, we create two matrices. Can you figure out how the contents of each matrix are created based on an initial vector called [data]?
rm(list=ls()) # this code cleans my environmentdata <-c(1, 2, 3, 4, 5, 6) # create datamatrix_1 <-matrix(data, nrow =2, ncol =3) matrix_2 <-matrix(data, nrow =3, ncol =2)# print these to console windowprint(matrix_1)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
print(matrix_2)
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Accessing matrix elements
We use square brackets [ ] with row and column indices to access elements in a matrix.
Important
Remember: R uses 1-based indexing! So the first element has an index of 1. In some programming languages, 0-based indexing is used instead, meaning the first element has an index of 0.
If you run the following code, you’ll see how different elements of a matrix can be extracted.
first_row_second_column <- matrix_1[1, 2] # note that this assumes you can the previous code to create matrix_1!entire_second_row <- matrix_1[2, ]entire_third_column <- matrix_1[, 3]print(first_row_second_column)
[1] 3
print(entire_second_row)
[1] 2 4 6
print(entire_third_column)
[1] 5 6
Modifying matrices
We can add or update elements in a matrix by assigning values using row and column indices:
matrix_1[1, 1] <-42# this changes the element at row 1, column 1 to 42matrix_2[, 2] <-c(7, 8, 9) # this changes the contents of column 2 to 7,8,9print(matrix_1)
[,1] [,2] [,3]
[1,] 42 3 5
[2,] 2 4 6
print(matrix_2)
[,1] [,2]
[1,] 1 7
[2,] 2 8
[3,] 3 9
Matrix operations
We can also perform arithmetic and logical operations on matrices, such as element-wise addition, subtraction, multiplication, and division:
A <-matrix(c(1, 2, 3, 4), nrow =2) # create our first matrixB <-matrix(c(5, 6, 7, 8), nrow =2) # create our second matrixsum_matrix <- A + B # create a new vector which is the sum of the two matricesproduct_matrix <- A * B # create a new vector which is the product of the two matricesprint(sum_matrix)
[,1] [,2]
[1,] 6 10
[2,] 8 12
print(product_matrix)
[,1] [,2]
[1,] 5 21
[2,] 12 32
Matrix functions
We can apply functions to matrices to perform various operations, such as calculating the transposed matrix, row and column sums, and more:
Do you understand what it means to ‘transpose’ a matrix?
transpose_matrix <-t(A) # this transposes matrix Arow_sums <-rowSums(A)col_sums <-colSums(A)print(transpose_matrix)
[,1] [,2]
[1,] 1 2
[2,] 3 4
Matrix multiplication
Use the * operator to perform matrix multiplication (not element-wise):
multiplied_matrix <- A *t(B)print(multiplied_matrix)
[,1] [,2]
[1,] 5 18
[2,] 14 32
6.4 Type Two: Arrays
Like a matrix, an ‘array’ can also be used to store data in a structured manner.
While a matrix is a two-dimensional structure (with rows and columns), an array is multi-dimensional.
A one-dimensional array (often just called a “vector”) is similar to a single row or column of a matrix. A two-dimensional array is the same as a matrix, which we covered above.
But arrays can go beyond this, forming three-dimensional structures (often visualised as a cube of data), four-dimensional, and so on. This allows arrays to model more complex data relationships and structures that a simple matrix cannot.
Each element in an array, regardless of its dimension, can be accessed using a set of indices, where the number of indices corresponds to the number of dimensions. For example, in a three-dimensional array, an element would be accessed using three indices.
It is unlikely that you will need to deal with arrays, but it’s worth knowing that they are there if you need them!
Lists are used to store and organize a collection of elements. Unlike vectors and matrices, lists can store elements of different data types and structures, such as numbers, characters, vectors, matrices, data frames, and even other lists.
Creating a list
You can use the list() function to create a list by combining elements:
rm(list=ls()) # this code cleans my environmentsimple_list <-list(42, "celtic", TRUE)nested_list <-list(number =42, text ="hello", vector =c(1, 2, 3), matrix =matrix(1:4, nrow =2))
When you run this code, look at your environment window, and click on [nested_list]. Do you see what the code above has created?
Accessing list elements
You can use double square brackets [[ ]] or the dollar sign with an index or a name to access elements in a list:
first_element <- simple_list[[1]] # access using indexnamed_element <- nested_list$text # access using namethird_element <- nested_list$vector # access using name
Modifying lists
We can add, update, or remove elements by assigning values using indexing or names:
simple_list[[2]] <-"banana"nested_list$new_element <-"Morton are great!"nested_list$number <-NULL# removes the 'number' element
List operations
We can also perform operations on elements within a list using indexing or names to access them:
We can apply functions to lists to perform different operations, such as calculating the length of the list or extracting specific elements from it:
list_length <-length(simple_list) # returns the list lengthfirst_two_elements <- simple_list[1:2] # returns the first two elements of the list
Converting lists
We can convert a list to other data structures using functions such as unlist(), as.data.frame(), or as.matrix(), as long as the list’s structure permits it:
simple_list <-list(1, 2, 3)vector_from_list <-unlist(simple_list) # create a vector from a listprint(vector_from_list)
[1] 1 2 3
nested_list <-list(list(1, 2), list(3, 4, 5))dataframe_from_list <-as.data.frame(nested_list) # create a dataframe from two listsprint(dataframe_from_list)
X1 X2 X3 X4 X5
1 1 2 3 4 5
6.6 Type Four: Data Frames
Data frames are a core data structure in R, and are used to store and organise data in a tabular format with rows and columns.
Before the introduction of tibbles (see below), they were the most common data structure encountered while using R.
Data frames are similar to matrices, but can store columns of different data types, making them ideal for handling datasets with mixed data types.
They closely resemble the way that data is stored in a spreadsheet application such as Excel, where we can have different types of data in different columns within our worksheet (for example, column A is player names, column B is their age).
Creating data frames
We use the data.frame() function to create a data frame by combining vectors or other data structures as columns:
rm(list=ls()) # this code cleans my environmentnames <-c("Scotland", "England", "Wales") # create a vector of namesages <-c(25, 30, 22) # create a vector of agesheights <-c(165, 180, 172) # create a vector of heightsdata <-data.frame(Name = names, Age = ages, Height = heights) # this creates a dataframe called [data], which includes all three vectorsprint(data)
Name Age Height
1 Scotland 25 165
2 England 30 180
3 Wales 22 172
Accessing elements in a data frame
As with matrices, we can use square brackets [ ], double square brackets [[ ]], or the dollar sign with row and column indices or names to access elements, rows, or columns in our data frame.
For example:
first_row <- data[1, ]age_column <- data$Age # note how we refer to a specific vector (variable) within the dataframethird_row_second_column <- data[3, "Age"]
Modifying data frames
We can add, update, or remove elements, rows, or columns by assigning values using indexing or names.
data$Name[1] <-"Alicia"# change an elementdata$Weight <-c(60, 85, 75) # add a new columndata[4, ] <-c("David", 23, 185, 80) # add a new rowdata$Weight <-NULL# Remove the 'weight' column
Data frame operations
We can also perform operations on elements, rows, or columns within a data frame using indexing or names to access them:
data$Age <-as.numeric(data$Age) # we need to convert data$Age to a numeric variable typeavg_age <-mean(data$Age) # we can then do some calculations on ittall_people <- data[data$Height >175, ]
Data frame functions
We can apply functions to data frames to perform various operations, such as calculating the dimensions, extracting specific elements, and more:
num_rows <-nrow(df) # this function (nrow) tells us how many rows are in our data framenum_columns <-ncol(df)column_names <-colnames(df)row_names <-rownames(df)
Subsetting (filtering) data frames
We can use logical conditions, column indices, or column names to filter or subset data frames:
We can also use this approach to remove a variable from a data frame:
data_02 <-subset(data, select =-c(Age)) # creates a new data frame without variable [Age]
6.7 Type Five: Tibbles
‘Tibbles’ are a fairly recent introduction to R, as part of the tidyverse package. They’re intended to make data manipulation more straightforward, and you will increasingly see them being used in preference to the older ‘data frame’ structure.
Tibbles offer several improvements over data frames, such as better printing in the console, the ability to handle column names with special characters or spaces, and automatic data type detection.
Tibbles are an integral part of the tidyverse package and work well with other tidyverse functions and packages.
As with all additional packages, you need to install and load the tidyverse package before you can use tibbles:
rm(list=ls()) # this code cleans my environmentlibrary(tidyverse) # assumes you've installed tidyverse!
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Creating tibbles
We can use the tibble() function to create a tibble, by combining vectors or other data structures as columns:
Why would you want to use tibbles rather than data frames? Well, there are a few reasons:
Tibbles have a refined print method that shows only the first 10 rows and all the columns that fit on screen, making them much easier to work with for large datasets.
Unlike data frames, tibbles do not simplify the results of subsetting operations into the lowest possible dimension; they always return another tibble. This means you won’t unexpectedly get a vector when you thought you were working with a data frame.
Tibbles allow column names that don’t meet R’s variable naming rules, like those that don’t start with a letter, or those that include spaces. This can be useful when working with datasets that have unusual column names.
Tibbles are more “lazy” than data frames, in that they delay most operations (like filtering or sorting) until they’re explicitly asked to perform them, and they’re more stringent about data types. This can make tibbles a bit slower than data frames for some operations, but it also helps prevent some common data cleaning and manipulation errors.
If a single row is selected from a data frame using square brackets, a data frame returns a vector. Tibbles, however, always return a tibble, which provides a more consistent behavior.
You will find that tibbles can provide more robust, predictable, and user-friendly behaviour than traditional data frames, particularly when dealing with large or complex datasets.
Accessing tibble elements
Similar to data frames, use square brackets [ ], double square brackets [[ ]], or the dollar sign with row and column indices or names to access elements, rows, or columns in a tibble:
We can add, update, or remove elements, rows, or columns by assigning values using indexing or names:
tb$Name[1] <-"Alicia"tb$Weight <-c(60, 85, 75) # Add a new columntb <-add_row(tb, Name ="David", Age =23, Height =185, Weight =80) # Add a new rowtb$Weight <-NULL# Remove the 'Weight' column
Tibble operations
We can perform operations on elements, rows, or columns within a tibble using indexing or names to access them:
This introduction to tibbles has really just scratched the surface of this data structure. Hadley Wickham has provided an excellent and comprehensive coverage here.